The current landscape of police brutality in the city of Chicago has community leaders, policy makers, and the public asking what can be done to prevent more deaths. Data plays a central role in helping to shed light around the excessive use-of-force and racially-motivated patterns of violence.
Our foundational research objective is to analyze data regarding police misconduct and accountability in Chicago. We endeavor to understand the nature, extent, and correlations between city demographics and police misconduct. Using the datasets provided by the Invisible Institute, we hope to provide insights on how these complaints impact Chicagoans geographically and demographically.
The data for our project was obtained from the Civilian Office of Police Accountability (COPA) via a Freedom of Information Act (FOIA) request made by the Invisible Institute. The data obtained contained a total of 7 datasets.
The first step was to load both the data and the required libraries into R.
install.packages("dplyr")
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.1.1 ✓ dplyr 1.0.6
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(janitor)
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(skimr)
library(RMySQL)
## Loading required package: DBI
library(keyring)
library(odbc)
library(readxl)
library(RSQLite)
##
## Attaching package: 'RSQLite'
## The following object is masked from 'package:RMySQL':
##
## isIdCurrent
library(dbplyr)
##
## Attaching package: 'dbplyr'
## The following objects are masked from 'package:dplyr':
##
## ident, sql
library(dplyr)
library(ggplot2)
library(esquisse)
library(modeldata)
df_accused <- read_excel("accused.xlsx")
df_case_info <- read_excel("case_info.xlsx")
df_civilian_witness <- read_excel("civilian_witness.xlsx")
df_complainant <- read_excel("complainant.xlsx")
df_cpd_witness <- read_excel("cpd_witness.xlsx")
df_investigators <- read_excel("investigators.xlsx")
df_victim <- read_excel("victim.xlsx")
The distinct_df_accused provides a dataframe for the primary key (LOG_NO) without duplicates.
distinct_df_accused <- distinct(df_accused)
count(distinct_df_accused, LOG_NO) %>% arrange(desc(n))
distinct_df_accused %>%
janitor::clean_names()
columns_missing_most_data <- distinct_df_accused %>%
summarise(across(everything(), ~ skimr::n_missing(.x))) %>%
pivot_longer(cols = everything(),
names_to = "variable",
values_to = "missing_count") %>%
mutate(proportion_missing = round(missing_count / nrow(df_accused), 2)) %>%
arrange(desc(missing_count)) %>%
# Detect columns that were missing more than 50% of the data
filter(proportion_missing > .5) %>%
pull(variable)
print(message("Columns missing 50%+ data:"))
## Columns missing 50%+ data:
## NULL
columns_missing_most_data
## [1] "NO_OF_DAYS" "PENALTY_CODE" "EMPLOYEE_DETAIL_UNIT"
The CPD datasets were in First Normal Form, with all entries being single-valued and atomic.
Our first step was to identify the primary keys in the dataset. The LOG_NO was the central key that tied all the datasets together.
The distinct function showed that there were 84,638 distinct LOG_NO entities in the dataset.
distinct(df_accused, LOG_NO)
A secondary primary key surfaced in the data, in the “PENALTY_ID” column. There are a total of 114,344 distinct PENALTY_IDs.
count(distinct_df_accused, PENALTY_ID) %>% arrange(desc(n))
The third primary key in the data was STAR_NO, which had a total of distince 14,395 entries.
count(distinct_df_accused, STAR_NO) %>% arrange(desc(n))
For 3NF we then grouped the data to ensure the columns were non-transitively dependent on the primary key. This included identifying: race code, sex code, penalty code, allegation code, investigation code, finding code, and location code.
Race Code
count(distinct_df_accused, RACE_CODE_CD) %>% arrange(desc(n))
Sex Code
count(distinct_df_accused, SEX_CODE_CD) %>% arrange(desc(n))
Penalty Code
count(distinct_df_accused, PENALTY_CODE) %>% arrange(desc(n))
Allegation Code
count(distinct_df_accused, ALLEGATION_CATEGORY, ALLEGATION_CATEGORY_CD) %>% arrange(desc(n))
We analyzed a total of 7 datasets to build the EER Model in SQL. The tables were partitioned by aggrieved party, accused/CPD, allegation, and outcome.
count(distinct_df_accused, ALLEGATION_CATEGORY) %>% arrange(desc(n))
select (distinct_df_accused, PENALTY_CODE)
There are a total of 225,523 cases within the accused dataset. When I filtered the data, I found that Chicago Police officers were penalized in 49,221 of those cases, which is only 21% of the time.
distinct_df_accused %>%
filter(! is.na(PENALTY_CODE))
49221/225523 * 100
## [1] 21.82527
176K of CPD did not have penalties listed. From the categories that were listed, 32,247 penalties were suspensions, 8,707 resulted in reprimands, 4,132 were categorized as ‘violation noted’, 4,131 faced ‘separation’.
count(distinct_df_accused, PENALTY_CODE) %>% arrange(desc(n))
distinct_df_accused %>%
filter(!(PENALTY_CODE %in% "WORK REGULAR DAY OFF") | is.na(PENALTY_CODE)) %>%
ggplot() +
aes(x = PENALTY_CODE, fill = PENALTY_CODE) +
geom_bar() +
scale_fill_hue(direction = -1) +
labs(
x = "Penalty Type",
y = "Total Penalties",
title = "Penalty Report"
) +
theme_bw() +
theme(legend.position = "bottom")
count(distinct_df_accused, ACCUSED_ARRESTED) %>% arrange(desc(n))
count(distinct_df_accused, FINDING_CODE) %>% arrange(desc(n))
distinct_df_accused <- distinct_df_accused %>%
mutate(FINDING_CODE = ifelse(
FINDING_CODE == "ADDITIONAL INVESTIGATION REQUESTED", "ADDITIONALINFOREQ", FINDING_CODE))
ggplot(distinct_df_accused) +
aes(x = FINDING_CODE, fill = FINDING_CODE) +
geom_bar() +
scale_fill_hue(direction = 1) +
labs(x = "Case Finding", y = "Total Number", title = "Finding Report") +
coord_flip() +
theme_minimal()